Automatic C-to-CUDA Code Generation for Affine Programs
نویسندگان
چکیده
Graphics Processing Units (GPUs) offer tremendous computational power. CUDA (Compute Unified Device Architecture) provides a multi-threaded parallel programming model, facilitating high performance implementations of general-purpose computations. However, the explicitly managed memory hierarchy and multi-level parallel view make manual development of high-performance CUDA code rather complicated. Hence the automatic transformation of sequential input programs into efficient parallel CUDA programs is of considerable interest. This paper describes an automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular (affine) programs. Using and adapting publicly available tools that have made polyhedral compiler optimization practically effective, we develop a C-to-CUDA transformation system that generates two-level parallel CUDA code that is optimized for efficient data access. The performance of automatically generated code is compared with manually optimized CUDA code for a number of benchmarks. The performance of the automatically generated CUDA code is quite close to hand-optimized CUDA code and considerably better than the benchmarks’ performance on a multicore CPU.
منابع مشابه
MetaFork: a compilation framework for concurrency models targeting hardware accelerators and its application to the generation of parametric CUDA kernels
In this paper, we present the accelerator model of MetaFork together with the software framework that allows automatic generation of CUDA code from annotated MetaFork programs. One of the key features of this CUDA code generator is that it supports the generation of CUDA kernel code where program parameters (like number of threads per block) and machine parameters (like shared memory size) are ...
متن کاملParametric GPU Code Generation for Affine Loop Programs
Partitioning a parallel computation into finitely sized chunks for effective mapping onto a parallel machine is a critical concern for source-to-source compilation. In the context of OpenCL and CUDA, this translates to the definition of a uniform hyper-rectangular partitioning of the parallel execution space where each partition is subject to a fine-grained distribution of resources that has a ...
متن کاملPIPS Is not (just) Polyhedral Software Adding GPU Code Generation in PIPS
Parallel and heterogeneous computing are growing in audience thanks to the increased performance brought by ubiquitous manycores and GPUs. However, available programming models, like OPENCL or CUDA, are far from being straightforward to use. As a consequence, several automated or semi-automated approaches have been proposed to automatically generate hardware-level codes from high-level sequenti...
متن کاملSpeculative Execution of Parallel Programs with Precise Exception Semantics on GPUs
General purpose computing on GPUs (GPGPU) can enable significant performance and energy improvements for certain classes of applications. However, current GPGPU programming models, such as CUDA and OpenCL, are only accessible by systems experts through lowlevel C/C++ APIs. In contrast, large numbers of programmers use highlevel languages, such as Java, due to their productivity advantages of ty...
متن کاملAutomatic Transformations for Effective Parallel Execution on Intel Many Integrated Core
We demonstrate in this work the potential effectiveness of a source-to-source framework for automatically optimizing a sub-class of affine programs on the Intel Many Integrated Core Architecture. Data locality is achieved through complex and automated loop transformations within the polyhedral framework to enable parallel tiling, and the resulting tiles are processed by an aggressive automatic ...
متن کامل